Tracking the Selective Pressure Profile and Gene Flow of SARS-CoV-2 Delta Variant in Italy from April to October 2021 and Frequencies of Key Mutations from Three Representative Italian Regions

The SARS-CoV-2 Delta variant of concern (VOC) was often associated with serious clinical course of the COVID-19 disease. Herein, we investigated the selective pressure, gene flow and evaluation on the frequencies of mutations causing amino acid substitutions in the Delta variant in three Italian regions. A total of 1500 SARS-CoV-2 Delta genomes, collected in Italy from April to October 2021 were investigated, including a subset of 596 from three Italian regions. The selective pressure and the frequency of amino acid substitutions and the prediction of their possible impact on the stability of the proteins were investigated. Delta variant dataset, in this study, identified 68 sites under positive selection: 16 in the spike (23.5%), 11 in nsp2 (16.2%) and 10 in nsp12 (14.7%) genes. Three of the positive sites in the spike were located in the receptor-binding domain (RBD). In Delta genomes from the three regions, 6 changes were identified as very common (>83.7%), 4 as common (>64.0%), 21 at low frequency (2.1%–25.0%) and 29 rare (≤2.0%). The detection of positive selection on key mutations may represent a model to identify recurrent signature mutations of the virus.


Introduction
The SARS-CoV-2 virus evolved rapidly with the emergence of new variants over time.Therefore, tracking the genome variability is essential to strengthen public health measures and preparedness, especially in the case of variants/mutations with possible impact on the transmissibility, severity and immunity [1,2].The epidemiological consequences of novel mutations are closely related to their impact on viral replication, transmission and on the competition between co-circulating viral strains.According to the Pangolin classification, the Delta variant consisted of about 245 different sublineages AY.x in addition to the parental strain B.1.617.2 ( [3], last access 5 October 2022).The SARS-CoV-2 Delta variant of concern (VOC) was dominant in Italy from mid-June until December 2021 [4,5].Subsequently, the Delta variant has been de-escalated by the Centers for Disease Control and Prevention (CDC, April 2022) and the World Health Organization (WHO June 2022), due to the almost exclusive circulation of the Omicron variant.The European Centre for Disease Prevention and Control (ECDC) de-escalated BA.1 and BA.3 (Omicron) on 12 August 2022 [6].At the time of writing, the BA.2, BA.4 and BA.5 (Omicron) were also de-escalated from the ECDC list of SARS-CoV-2 variants of concern (VOC) [7], as these lineages are no longer circulating in Europe.
Herein, we investigated the gene flows and selective pressure by a bioinformatic approach on the Delta variant circulating in Italy in 2021, since this VOC was often associated with serious clinical course of the COVID-19 disease [14].The frequency of key mutations localised in the positively selected sites identified in genomes from three representative Italian regions (Lazio, Sicily and Veneto) was also investigated.
Selective pressure is generally measured by the nonsynonymous/synonymous rate (dN/dS = ω), considering a nonsynonymous rate standing above the synonymous rate as evidence of selection [15].Thus, when ω > 1 the amino acid (aa) change offers a selective advantage and is fixed at a faster rate than a synonymous mutation, evidencing a diversifying selection (positive selective pressure) [16].Since the selective pressure profile of Delta variant in Italy remains poorly defined, this study can help to identify: (i) the positive and negative selection and the sites where they occur; (ii) the evolutionary dynamics and the recurrent mutations on those obtained from the three regions: Lazio, Sicily and Veneto; (iii) a pattern and compendium of mutations that need to be closely monitored, also on other future variants that will emerge [17], and stability of the proteins; (iv) a model to predict recurrent mutations.

Dataset and Sequence Alignment
A total of 1500 SARS-CoV-2 Delta variant genomes, collected in Italy from April to October 2021 (uploaded and analysed in the Italian COVID-19 Genomic I-Co-Gen national platform and deposited in GISAID) [18], were investigated.The dataset was built in relation to the total number of Delta genomes available at the 14 of October 2021.Specifically, 712 genomes from northern (47.4%), 259 from central (17.3%) and 529 (35.3%) from southern Italy were investigated.A total of 596 Delta genomes obtained from the above reported dataset were used to carry out an in-depth analysis, including sequences from three regions: north (Veneto), centre (Lazio) and south (Sicily) of Italy.These were used to estimate the genetic variability and the frequency of key mutations in the positively selected sites identified during the same study period.

Gene Flow and Migration Analysis
The MacClade version 4 program (Sinauer Associates, Sunderland, MA) was used to test gene out/in flow in Italy, among SARS-CoV-2 Delta variant sequences, applying a modified version of the Slatkin and Maddison test [23].A maximum likelihood (ML) phylogenetic tree was built using the IQ-TREE software v.1.6.12 [24] with the GTR model and used as the starting tree for this analysis.The ultrafast bootstrap approximation (UFBoot) and the SH-like approximate likelihood ratio test (SH-aLRT) were used for branch support values [25].A one-character data matrix was obtained from the dataset by assigning to each taxon in the tree a one-letter code indicating its own sampling location, according to the different geographic areas in Italy (north, centre and south).The putative origin of each ancestral sequence (i.e., internal node) in the tree was inferred by finding the most parsimonious reconstruction (MPR) of the ancestral character.The final tree length, which is the number of observed gene flow events in the genealogy, can easily be computed and compared to the tree-length distribution of 10,000 trees obtained by random joining-splitting (null distribution).Observed genealogies significantly shorter than random trees indicated the presence of subdivided populations with restricted gene flow.The gene flow among the different geographic areas (character states) was traced with the state changes and stasis tool through the MacClade software [23], which counts the number of changes in a tree for each pairwise character state.When multiple MPRs were present, the algorithm calculated the average migration count over all possible MPRs for each pair.

Selective Pressure Analysis
The selective pressure analysis was performed on the above reported SARS-CoV-2 protein-coding sequence subsets, with the aim to characterise the SARS-CoV-2 variations and the evolutionary dynamics in Italy, identifying the statistically supported positive and negative selective pressure sites.
A positive diversifying selection was inferred on sites statistically significant for a value of nonsynonymous to synonymous substitution ω > 1, while purifying selection was inferred for ω < 1 [26].On the contrary, neutrality was inferred for ω = 1 [26].
The fast unconstrained Bayesian approximation (FUBAR) and fixed effects likelihood (FEL) models were used [27,28] to identify selection under the HYPHY software v. 2.2.4 [29].The FUBAR method infers the nonsynonymous (dN) and synonymous (dS) substitution rates on a per-site basis in large datasets, based on the assumption that a pervasive selection pressure is constant in the entire phylogeny [27].
The FEL model uses a ML approach to infer dN and dS substitution rates on a persite basis for a given coding alignment and corresponding phylogeny [28].This method assumes that the selection pressure for each site is constant along the entire phylogeny.
Only the selective pressure sites confirmed by both FEL (p ≤ 0.05) and FUBAR (posterior probability ≥ 0.9) were reported as statistically supported.
The positions of the selective pressure sites and mutations in the different SARS-CoV-2 subsets were referred with respect to the protein products obtained from the SARS-CoV-2 reference Wuhan-Hu-1 (accession number: NC_045512.2).
The frequency of each amino acid substitution in the positively selected sites was calculated in the full dataset and in the subset of 596 SARS-CoV-2 Delta genomes from Lazio, Sicily and Veneto in order to classify them as very common, common, intermediate, at low frequency or rare.The prediction of the possible impact of the amino acid substitutions on the stability and structure of the protein was investigated through the I-Mutant 2.0 and PolyPhen-2 tools, respectively, as previously reported [30].

Gene Flow and Selective Pressure Analysis
The gene flow analysis performed according to the three geographic areas of Italy (north, centre and south), showed that most of the statistically supported gene flow events (36.1%) were identified from the north to the south; 6.7% of the supported gene flow events were found from the centre to the north; finally, 7.2% of supported gene flow was highlighted from the south to the centre of Italy (Figure 1).
The frequency of each amino acid substitution in the positively selected sites was calculated in the full dataset and in the subset of 596 SARS-CoV-2 Delta genomes from Lazio, Sicily and Veneto in order to classify them as very common, common, intermediate, at low frequency or rare.The prediction of the possible impact of the amino acid substitutions on the stability and structure of the protein was investigated through the I-Mutant 2.0 and PolyPhen-2 tools, respectively, as previously reported [30].

Gene Flow and Selective Pressure Analysis
The gene flow analysis performed according to the three geographic areas of Italy (north, centre and south), showed that most of the statistically supported gene flow events (36.1%) were identified from the north to the south; 6.7% of the supported gene flow events were found from the centre to the north; finally, 7.2% of supported gene flow was highlighted from the south to the centre of Italy (Figure 1).Overall the selective pressure showed considerable variation among the SARS-CoV-2 protein coding genes.The analysis of the Delta variant Italian dataset revealed 68 positively selected sites dispersed in the different protein coding genes, as shown in Table S1.More than 9 positively selected sites were identified in nsp2, nsp12 and spike (Table S1).In particular, 11 positively selected sites (16.2%) were identified in nsp2, 10 in nsp12 (14.7%) and 12 in the spike (17.6%).
The positively selected sites were further analysed to investigate the frequency of each amino acid replacement in our dataset (Table 1) in order to classify them as very common, common, at low frequency or rare.Six changes were identified as very common mutations (frequency > 83.7%), three substitutions were identified as common mutations (frequency > 64.0%), twenty-two mutations were identified at low frequency (between 2.1% and 25.0%) and fifty-three were rare (frequency ≤ 2.0%) (Table 1).Additionally, 85.7% of the amino acid replacements were predicted to decrease, 13.1% to increase and 1.2% not to change the stability of the protein (Table S2).
Overall, 29.8% of the amino acid changes were predicted by PolyPhen-2 as probably damaging the protein structure (score > 0.97), about 19.0% of the changes were predicted as possibly damaging, 48.8% as benign and, lastly, the probability of affecting protein structure was not known for 2.4% (Table S2).
The frequency of the key mutations in the positively selected sites in SARS-CoV-2 Delta genomes from Lazio, Sicily and Veneto altogether (Table 1) showed that 6 changes were identified as very common (frequency > 83.7%), 4 as common (frequency > 64.0%), 21 at low frequency (between 2.1% and 25.0%) and 29 were rare (frequency ≤ 2.0%) (Table 1).Twenty-four of the mutations in the positively selected sites previously reported in the full dataset (n = 1500) were not identified in the genomes from Lazio, Sicily and Veneto (n = 596).The frequency estimated separately for each selected region showed in Lazio 9 changes as very common (frequency > 83.7%), no common mutations (frequency > 64.0%), 16 changes at low frequency (between 2.1% and 25.0%), 1 intermediate and 22 were identified as rare (frequency ≤ 2.0%).Thirty-six of the mutations in positively selected sites, reported for the full dataset, were not identified in the Delta genomes from Lazio (Table 1).In Sicily, 7 changes were identified as very common mutations (frequency > 83.7%), 3 substitutions as common (frequency > 64.0%), 14 were identified at low frequency (between 2.1% and 25.0%), 3 intermediate and 21 were rare (frequency ≤ 2.0%) (Table 1).Thirty-six of the mutations in positively selected sites, reported for the full dataset, were not identified in the genomes from Sicily.Finally, in Veneto, 7 changes were identified as very common mutations (frequency > 83.7%), 3 as common mutations (frequency > 64.0%), 16 were identified at low frequency (between 2.1% and 25.0%) and 13 were rare (frequency ≤ 2.0%).Forty-five of the mutations in positively selected sites, reported for the full dataset, were not identified in Veneto (Table 1).
Evident differences in frequencies of specific mutations were highlighted between genomes from Lazio, Sicily and Veneto (Tables 1 and S3).

Discussion
The epidemic dynamics of COVID-19 in Italy and worldwide showed multiple waves, characterised by the emergence of different SARS-CoV-2 variants [2].
Delta variant (B.1.617.2) emerged as the dominant across multiple countries and was endowed with enhanced infectivity and antibody escape capacity for the presence of key amino acid substitutions in the spike protein [32].The Delta variant was associated with more severe infection, with patients more likely to be hospitalised and suffering longer infection course [33].
In Italy, Delta variant was dominant from mid-June until December 2021 [4,5]; afterward, Omicron variant became largely predominant [34,35].This study provides a genomic analysis on Delta variant Italian dataset as a tool to identify the positive, negative selection, the evolutionary dynamics, and the recurrent mutations that need to be closely monitored also on other future variants for potential implications in public health.
The gene flow approach could help to identify the structure of the dispersal pattern and intermixing [23,36].Overall, the study suggested that the gene flow of most of the SARS-CoV-2 Delta variant (36.1%) was from the northern to the southern part of Italy.
A similar percentage of gene flow (about 7.0%) was identified from central to northern of Italy and from southern to central.
The selective pressure analysis provided a large-scale genomic analysis towards understanding the selective pressure pattern on Italian Delta variant genomes.In addition, it allowed identification of the amino acid changes endowed of a selective advantage that were fixed at a faster rate than a synonymous mutation (positive selective pressure, ω > 1).
Most of the mutations identified in this study as positively selected sites, were also previously identified in other SARS-CoV-2 lineages internationally, as suggested by the genomes available in GISAID as of 20 October 2022 ([18], (last access 20 October 2022).
In particular, already starting from the first epidemic phase, some of them (i.e., the mutations in the spike protein V367F, D614G [37] or the mutation A222V) emerged since summer 2020 in the 20E_EU1 cluster of the SARS-CoV-2 virus, presumably in Spain and then in Europe [38].
The highest number of positive selected sites identified in the spike, followed by nsp2 and nsp12, suggested a possible evolutionary advantage to the virus, being specifically localised in regions or proteins with important functional roles (i.e., the receptor-binding domain RBD in the spike protein).
The alterations in RBD, hypothesised as modifying RBD-ACE2 affinity, are generally rare [41].Other authors suggested that the primary driver of positive selection arising from most mutations within the RBD is enhanced neutralisation resistance as opposed to increased affinity of S to ACE2 [41][42][43][44].
Some of the mutations detected at higher frequency in the full dataset were also confirmed at higher frequencies in genomes from the three representative areas (Lazio, Sicily, Veneto), such as G142D, L452R, D614G, P681R, D950N in the spike or the P323L in nsp12, probably indicating that these amino acid changes might favour viral adaptation.
An investigation of neutralising antibodies targeting the N-terminal domain (NTD) of the spike revealed a "supersite" for some known antibodies [45], considered a site of vulnerability for the SARS-CoV-2 virus.The T95I amino acid substitution does not occur close to the NTD neutralisation "supersite" but it was identified in our dataset as a positively selected site, with a frequency of about 20-21% among the sequences identified in Lazio, Veneto and about 12% in Sicily.A study performed on patients infected with SARS-CoV-2 showed an increased viral load for patients with variants showing the T95I [46].Structural modelling analysis revealed that topological changes may occur in the NTD "supersite" as a result of the T95I, suggesting an effect of alteration of the topology of the supersite and affecting SARS-CoV-2 neutralisation by sera from vaccinated persons [41,46], suggesting the need to monitor all the mutations in the NTD region.
None of the positively selected sites identified in this study were already reported and included by COG.UK/Mutation Explorer in the list of the mutations conferring resistance to antiviral therapies ([47], last access 20 October 2022).
Six of the twenty-four mutations identified in the spike protein (T95I, G142D, A222V, V367F, L452R, N501Y) were associated with a weaker neutralisation of the virus by convalescent plasma from people who have been infected with SARS-CoV-2 and/or by monoclonal antibodies that recognise the SARS-CoV-2 spike protein ("escape" mutations) according to COG.UK [47].Four of them were identified in the genomes from Lazio, Sicily and Veneto, and the N501Y only among sequences from Sicily.
Moreover, some of the amino acid changes were predicted to have a possible impact on the structure and stability of the proteins and need to be closely monitored.The detection of positive selection may represent an approach to identify key signature mutations.
Before drawing conclusions, limits and possible bias of the study have to be mentioned.This model is dependent on the availability of SARS-CoV-2 genomes -and on the limits imposed by the models used for the analysis.
The findings might provide a compendium of the SARS-CoV-2 mutations fixed at a faster rate, relative to synonymous changes and on the selective pressure profile in a Delta variant Italian dataset.
The selective pressure was probably the most likely reason for convergent evolution, that is different variants acquiring independently a group of recurrent mutations (i.e., residues K417, L452, E484, N501 and P681 of the spike for Alpha, Beta, Gamma and Delta variants or residues R346, K444, N450, N460, F486, F490, Q493 and S494 for Omicron and its sublineages) [47].
This study may update information on previous circulating SARS-CoV-2 strains, and help to track the presence of specific mutations in key viral genes.

Conclusions
In conclusion, this study provides a picture of the selective pressure profile and gene flows in a subset of Delta variant genomes identified in Italy, highlighting how specific mutations may become fixed in this viral population, how they affect the stability of the proteins, and, finally provides a model for recurrent mutations.

Figure 1 .
Figure 1.Maximum parsimony migration patterns of SARS-CoV-2 Delta variant genomes to/from different areas of Italy.The bubblegram shows the frequency of the statistically supported gene flow (migrations) identified to/from different geographic areas of Italy, as the percentage of the total observed migrations estimated from the maximum likelihood tree.A, north; B, centre; C, south.

Figure 1 .
Figure 1.Maximum parsimony migration patterns of SARS-CoV-2 Delta variant genomes to/from different areas of Italy.The bubblegram shows the frequency of the statistically supported gene flow (migrations) identified to/from different geographic areas of Italy, as the percentage of the total observed migrations estimated from the maximum likelihood tree.A, north; B, centre; C, south.

Table 1 .
Frequency of amino acid mutations identified as positively selected sites from SARS-CoV-2 Delta genomes collected from the full dataset and from three regions (total number n = 1500 and n = 596, of which 213 are from Lazio, 245 from Sicily and 138 from Veneto).